Introduction and Summary
This project aims to create a comment classifier capable of assigning specific toxicity labels to text comments, such as insults, identity hate, etc.
Throughout the project, various experiments were conducted using pre-trained NLP models, including RoBERTa and DistilBERT. The outcomes produced a 0.693 AUC-PR and a 0.67 macro average F1-Score.
Imports:
Show the code
% reload_ext autoreload
% autoreload 1
import matplotlib.pyplot as plt
from sklearn.metrics import (
f1_score,
precision_score,
recall_score,
classification_report
)
from skmultilearn.model_selection.iterative_stratification import (
iterative_train_test_split,
)
from model.model import CommentClassifier, TunerDynamicUnfreeze
from model.data_loader import CommentDataModule
import auxiliary_functions.auxiliary_functions as aux
from transformers import AutoTokenizer
import pytorch_lightning as ptl
import torch
import joblib
import zipfile
import os
import polars as pl
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import ticker
from tabulate import tabulate
from IPython.display import Markdown, display
from langdetect import detect_langs
import numpy as np
from IPython.display import clear_output
from lime.lime_text import LimeTextExplainer
% aimport model.model
% aimport model.data_loader
% aimport auxiliary_functions.auxiliary_functions
Options:
Show the code
BASE_FIG_SIZE = (8.5 , 4.5 )
np.random.seed(1 )
Data Loading
Downloading the data set:
Show the code
! kaggle competitions download - c jigsaw- toxic- comment- classification- challenge - p data
Unzip and store files:
Show the code
zip_file_path = "data/jigsaw-toxic-comment-classification-challenge.zip"
with zipfile.ZipFile(zip_file_path, "r" ) as zip_ref:
zip_ref.extractall("data" )
os.remove(zip_file_path)
zip_files = os.listdir("data" )
for zip in zip_files:
zip_path = f"data/ { zip } "
with zipfile.ZipFile(zip_path, "r" ) as zip_ref:
zip_ref.extractall("data" )
os.remove(zip_path)
Read the data:
Show the code
train_data = pl.read_csv("data/train.csv" )
Exploratory Data Analysis
Number of samples in the training data:
Overview:
Show the code
shape: (5, 8)
str
str
i64
i64
i64
i64
i64
i64
"0000997932d777…
"Explanation Wh…
0
0
0
0
0
0
"000103f0d9cfb6…
"D'aww! He matc…
0
0
0
0
0
0
"000113f07ec002…
"Hey man, I'm r…
0
0
0
0
0
0
"0001b41b1c6bb3…
"" More I can't…
0
0
0
0
0
0
"0001d958c54c6e…
"You, sir, are …
0
0
0
0
0
0
Show the code
toxicity_labels = train_data.columns[- 6 :]
Are there duplicate ID’s?
Show the code
train_data["id" ].is_duplicated().any ()
Are there duplicate comments?
Show the code
train_data["comment_text" ].is_duplicated().any ()
Class Balance
Percentage of comments with each label:
Show the code
fig_classes, ax_classes = plt.subplots(figsize= BASE_FIG_SIZE)
class_proportion = train_data[:, 2 :].sum ().to_numpy()[0 ] / len (train_data)
sns.barplot(x= class_proportion * 100 , y= train_data.columns[2 :], ax= ax_classes)
ax_classes.xaxis.set_major_formatter(ticker.PercentFormatter())
ax_classes.set_xlabel("Percentage of Comments with Label" )
Text(0.5, 0, 'Percentage of Comments with Label')
The toxicity labels are rare with the lowest minority labels only being present in less than 1% of the comments.
Can comments have more than one label?
Show the code
label_sums = train_data[:, 2 :].sum_horizontal().value_counts()
label_sums.columns = ["Total labels" , "Number of Comments" ]
label_sums.sort("Number of Comments" )
shape: (7, 2)
i64
u32
6
31
5
385
4
1760
2
3480
3
4209
1
6360
0
143339
Comments can have more than one label.
Fraction of comments with no labels:
Show the code
round (
(
label_sums.filter (pl.col("Total labels" ) == 0 )["Number of Comments" ]
/ len (train_data)
).item(),
2 ,
)
90% of the comments are benign.
Language Detection
Using a language detection model to predict the language in each comment:
Show the code
def detect_language(text):
try :
result = detect_langs(text)[0 ].lang
except :
result = "empty"
return result
languages = train_data["comment_text" ].map_elements(detect_language)
joblib.dump(languages, "temp/languages.joblib" )
Loading the predictions:
Show the code
languages = joblib.load("temp/languages.joblib" )
Major non-english languages detected:
Show the code
train_data.with_columns(languages.alias("lang" )).filter (pl.col("lang" ) != "en" )[
"lang"
].value_counts().sort("counts" , descending= True ).head()
shape: (5, 2)
str
u32
"de"
571
"fr"
373
"af"
344
"so"
275
"id"
269
Using a language detection model it is predicted that a few comments might not be in english and therefore would require different NLP models.
Inspecting some of the predicted non-english comments:
Show the code
aux.table_display(
train_data.with_columns(languages.alias("lang" ))
.filter (pl.col("lang" ) != "en" )[["comment_text" , "lang" ]]
.head(20 ),
tablefmt= "html" ,
)
comment_text
lang
REDIRECT Talk:Voydan Pop Georgiev- Chernodrinski
da
REDIRECT Talk:Frank Herbert Mason
de
Oh, it's me vandalising?xD See here. Greetings,
af
Azari or Azerbaijani?
Azari-iranian,azerbaijani-turkic nation.
sl
86.29.244.57|86.29.244.57]] 04:21, 14 May 2007
tl
Future Perfect at Sunrise|☼]] 14:59, 16
ro
REDIRECT Talk:José Manuel Rojas
et
Valerie Poxleitner
Valeri Poxleitner, A.K.A. Lights. If
de
|listas = Manos Family
es
Barnes Aus 1 1 8
de
06:15, 19 Aug 2004 (UTC)
de
P.S. Are you a /b/tard?
ca
" No problem at all! (talk) "
no
I've just seen that
nl
Batman
I am Batman. You are Spiderman. I win.
id
WikiDon, STOP stalking me!
af
Type 3 looks gorgeous ) (talk)
af
|listas = Schaefer, Nolan
de
"::I LOL'd hardest at J.delanoy's. P Cobra
"
ca
REDIRECT Talk:Jeopardy! (video games)
et
Upon further inspection of some of these comments it is evident that most of them are in fact in english and was flagged as not due to the model’s imperfections. Therefore no comments will be omitted.
Modeling
Class interaction schema:
The data preprocessing and model training process are established using several custom classes that inherit from PyTorch classes. The setup is illustrated in the schema above. The values of parameters highlighted in red are adjustable through a configuration file, which is employed to instantiate a dataloader, model, and trainer.
The parameters in the configuration file are as follows:
data: polars DataFrame or string directory
batch_size: Batch Size
model: Text name of the model to be used in the AutoModel and AutoTokenizer .from_pretrained() function
tokenizer_max_len: Maximum length in the tokenizer to which tokenized text is padded or truncated
class_weights: balanced or None. Balanced weighs each label differently based on its frequency, assigning higher weights to rare labels.
learning_r: Initial learning rate
stop_patience: Number of epochs to continue after the training delta limit is reached
stop_delta: Lowest difference between the current and last epoch validation loss, after which the training is stopped
unfreeze_delta: Lowest difference between the current and last epoch validation loss, after which all model layers are unfrozen
tuning_lr: Learning rate of the model’s backbone
max_epochs: The maximum number of epochs if not stopped earlier
dropout: If not None, a float fraction corresponding to the dropout rate of a dropout layer before the final classification layer
under_sample: Use only this fraction of benign comments in the training data
train_frac: Use only a fraction of the training data
val_frac: Use only a fraction of the validation data
test_frac: Use only a fraction of the test data
name: Experiment name
Load tensorboard:
Experiment 1: DistilBERT
The first experiment uses DistilBERT as a base model.
Set Experiment 1 config and train the model:
Show the code
config = {
"data" : train_data,
"batch_size" : 16 ,
"model" : "distilbert-base-uncased" ,
"tokenizer_max_len" : 150 ,
"class_weights" : "balanced" ,
"learning_r" : 1e-4 ,
"stop_patience" : 2 ,
"stop_delta" : 1e-5 ,
"unfreeze_delta" : 1e-4 ,
"tuning_lr" : 1e-5 ,
"max_epochs" : 100 ,
"dropout" : None ,
"under_sample" : 0.1 ,
"train_frac" : 1 ,
"val_frac" : 0.3 ,
"test_frac" : 1 ,
"name" : "distilbert_bs16_lr5e-4_w_166maxl_undersampling0.1_testsmall" ,
}
data_loader, model, trainer = aux.set_model_from_config(config= config)
trainer.fit(model, datamodule= data_loader)
Evaluate the model:
Show the code
all_metrics = {}
test_loader = data_loader.test_dataloader()
preds_distilbert_01 = trainer.predict(model, test_loader)
true_labels_distilbert_01 = torch.Tensor(test_loader.dataset.labels.to_numpy())
preds_distilbert_01 = torch.cat([batch.sigmoid() for batch in preds], dim= 0 )
clear_output()
metrics_disbert_v01 = aux.evaluate_model(
true_labels= true_labels_distilbert_01,
predictions= preds_distilbert_01,
labels= train_data.columns[- 6 :],
print_metrics= True ,
)
all_metrics["distilbert_01" ] = metrics_disbert_v01
Experiment 2 RoBERTa
With the hopes that added complexity of the RoBERTa model improves the performance this model is used next with the same hyperparameters.
Set Experiment 2 config and train the model:
Show the code
config_roberta_01 = {
"data" : train_data,
"batch_size" : 16 ,
"model" : "roberta-base" ,
"tokenizer_max_len" : 150 ,
"class_weights" : "balanced" ,
"learning_r" : 1e-4 ,
"stop_patience" : 2 ,
"stop_delta" : 1e-5 ,
"unfreeze_delta" : 1e-4 ,
"tuning_lr" : 1e-5 ,
"max_epochs" : 100 ,
"dropout" : None ,
"under_sample" : 0.1 ,
"train_frac" : 1 ,
"val_frac" : 0.3 ,
"test_frac" : 1 ,
"name" : "roberta_bs16_lr1e-4_w_166maxl_undersampling0.1" ,
}
data_loader_roberta, model_roberta, trainer_roberta = aux.set_model_from_config(
config= config_roberta_01
)
trainer_roberta.fit(model_roberta, datamodule= data_loader_roberta)
Evaluate the model:
Show the code
test_loader_roberta = data_loader_roberta.test_dataloader()
preds_roberta = trainer_roberta.predict(model_roberta, test_loader_roberta)
preds_roberta = torch.cat([batch.sigmoid() for batch in preds_roberta], dim= 0 )
true_labels_roberta = torch.Tensor(test_loader_roberta.dataset.labels.to_numpy())
clear_output()
metrics_roberta = aux.evaluate_model(
predictions= preds_roberta,
true_labels= true_labels_roberta,
labels= train_data.columns[- 6 :],
print_metrics= True ,
)
all_metrics["roberta_01" ] = metrics_roberta
Experiment 3: DistilBERT with a lower learning rate and batch size
Using a more complex model did not improve the result and hindered the training time. Therefore the next experiment is done using DistilBERT, but this time with a lower learning rate and a lower batch size.
Set Experiment 3 config and train the model:
Show the code
config_distilbert_02 = {
"data" : train_data,
"batch_size" : 8 ,
"model" : "distilbert-base-uncased" ,
"tokenizer_max_len" : 150 ,
"class_weights" : "balanced" ,
"learning_r" : 5e-5 ,
"stop_patience" : 2 ,
"stop_delta" : 1e-5 ,
"unfreeze_delta" : 1e-4 ,
"tuning_lr" : 5e-6 ,
"max_epochs" : 100 ,
"dropout" : None ,
"under_sample" : 0.1 ,
"train_frac" : 1 ,
"val_frac" : 0.3 ,
"test_frac" : 1 ,
"name" : "distilbert_bs8_lr5e-5_w_166maxl_undersampling0.1" ,
}
(
data_loader_distilbert_02,
model_distilbert_02,
trainer_distilbert_02,
) = aux.set_model_from_config(config= config_distilbert_02)
trainer_distilbert_02.fit(model_distilbert_02, data_loader_distilbert_02)
Evaluate the model:
Show the code
test_loader_distilbert_02 = data_loader_distilbert_02.test_dataloader()
preds_distilbert_02 = trainer_distilbert_02.predict(
model_distilbert_02, test_loader_distilbert_02
)
preds_distilbert_02 = torch.cat(
[batch.sigmoid() for batch in preds_distilbert_02], dim= 0
)
true_labels_distilbert_02 = torch.Tensor(
test_loader_distilbert_02.dataset.labels.to_numpy()
)
clear_output()
metrics_distilbert_02 = aux.evaluate_model(
predictions= preds_distilbert_02,
true_labels= true_labels_distilbert_02,
labels= train_data.columns[- 6 :],
print_metrics= True ,
)
all_metrics["distilbert_02" ] = metrics_distilbert_02
AUC-ROC: 0.9893
AUC-PR: 0.6927
Classification Report:
precision recall f1-score support
toxic 0.53 0.95 0.68 3059
severe_toxic 0.28 0.89 0.43 319
obscene 0.62 0.94 0.75 1690
threat 0.30 0.82 0.44 95
insult 0.60 0.90 0.72 1576
identity_hate 0.39 0.76 0.51 281
micro avg 0.53 0.93 0.68 7020
macro avg 0.45 0.88 0.59 7020
weighted avg 0.55 0.93 0.68 7020
samples avg 0.41 0.91 0.82 7020
Experiment 4: DistilBERT with an additional dropout layer
Lowering the learning rate and the batch size slightly improved the performance. Next an extra dropout layer with a rate of 0.25 is added before the final classification layer to regularize the model better.
Set Experiment 4 config and train the model:
Show the code
config_distilbert_03 = {
"data" : train_data,
"batch_size" : 8 ,
"model" : "distilbert-base-uncased" ,
"tokenizer_max_len" : 150 ,
"class_weights" : "balanced" ,
"learning_r" : 5e-5 ,
"stop_patience" : 2 ,
"stop_delta" : 1e-5 ,
"unfreeze_delta" : 1e-4 ,
"tuning_lr" : 5e-6 ,
"max_epochs" : 100 ,
"dropout" : 0.25 ,
"under_sample" : 0.1 ,
"train_frac" : 1 ,
"val_frac" : 0.3 ,
"test_frac" : 1 ,
"name" : "distilbert_bs8_lr5e-5_w_166maxl_undersampling0.1_dropout025" ,
}
(
data_loader_distilbert_03,
model_distilbert_03,
trainer_distilbert_03,
) = aux.set_model_from_config(config= config_distilbert_03)
trainer_distilbert_03.fit(model_distilbert_03, data_loader_distilbert_03)
Evaluate the model:
Show the code
test_loader_distilbert_03 = data_loader_distilbert_03.test_dataloader()
preds_distilbert_03 = trainer_distilbert_03.predict(
model_distilbert_03, test_loader_distilbert_03
)
preds_distilbert_03 = torch.cat(
[batch.sigmoid() for batch in preds_distilbert_03], dim= 0
)
true_labels_distilbert_03 = torch.Tensor(
test_loader_distilbert_03.dataset.labels.to_numpy()
)
clear_output()
metrics_distilbert_03 = aux.evaluate_model(
predictions= preds_distilbert_03,
true_labels= true_labels_distilbert_03,
labels= train_data.columns[- 6 :],
print_metrics= True ,
)
all_metrics["distilbert_03" ] = metrics_distilbert_03
Adding an extra dropout layer did not improve the performance of the model but it did improve the training speed causing the model to converge in a lesser number of epochs.
Get the number of epochs for each model:
Show the code
all_metrics["distilbert_01" ]["Training Epochs" ] = trainer.callbacks[4 ].stopped_epoch - 2
all_metrics["roberta_01" ]["Training Epochs" ] = (
trainer_roberta.callbacks[4 ].stopped_epoch - 2
)
all_metrics["distilbert_02" ]["Training Epochs" ] = (
trainer_distilbert_02.callbacks[4 ].stopped_epoch - 2
)
all_metrics["distilbert_03" ]["Training Epochs" ] = (
trainer_distilbert_03.callbacks[4 ].stopped_epoch - 2
)
Save the best model and metrics:
Show the code
joblib.dump(
(model_distilbert_02, data_loader_distilbert_02, trainer_distilbert_02),
f"temp/models/distilbert_02.joblib" ,
)
joblib.dump(all_metrics, "temp/all_metrics.joblib" )
joblib.dump((true_labels_distilbert_02, preds_distilbert_02), "temp/test_preds.joblib" )
Load saved model and metrics:
Show the code
model_distilbert_02, data_loader_distilbert_02, trainer_distilbert_02 = joblib.load(
"temp/models/distilbert_02.joblib"
)
all_metrics = joblib.load("temp/all_metrics.joblib" )
true_labels_distilbert_02, preds_distilbert_02 = joblib.load("temp/test_preds.joblib" )
Model Comparison
Due to label imbalance the models are compared using the average area under the precision recall curve for each label.
AUC-PR comparison between experiments:
Show the code
experiment_params = [
"DistilBERT, Learning r = 5e-4, Batch size = 16," ,
"RoBERTa, Learning r = 5e-4, Batch size = 16," ,
"DistilBERT Learning r = 5e-5 Batch size = 8," ,
"DistilBERT Learning r = 5e-5 Batch size = 8, Dropout = 0.25," ,
]
fig_auc_pr_compare, ax_aurc_pr_compare = plt.subplots(figsize= BASE_FIG_SIZE)
for i, metrics in enumerate (all_metrics.values()):
sns.barplot(
y= [i],
x= [metrics["AUC-PR" ]],
orient= "h" ,
ax= ax_aurc_pr_compare,
color= sns.color_palette()[0 ],
)
ax_aurc_pr_compare.annotate(
f" { experiment_params[i]} Training Epochs: { metrics['Training Epochs' ]} " ,
(0.631 , i),
)
ax_aurc_pr_compare.set_xlim(0.63 , 0.695 )
ax_aurc_pr_compare.set_yticks([])
ax_aurc_pr_compare.set_xlabel("Average Area Under the Precision Recall Curve" )
plt.show()
Model from experiment 3 is chosen for further evaluation.
Model Evaluation
Plotting the classification metrics for each label with different decision thresholds and selecting optimal values:
Show the code
fig_therholds, ax_thresholds = plt.subplots(
2 , 3 , figsize= (BASE_FIG_SIZE[0 ], BASE_FIG_SIZE[1 ] * 1.5 )
)
ax_thresholds = ax_thresholds.flatten()
f1s_all_labels = []
best_thresholds = []
for i, ax in enumerate (ax_thresholds):
predicted_probs = preds_distilbert_02[:, i]
true_labels = true_labels_distilbert_02[:, i]
label = train_data.columns[- 6 :][i]
thresholds = np.linspace(0.0 , 1 , num= 50 )
precisions, recalls, f1_scores = [], [], []
for threshold in thresholds:
binary_preds = predicted_probs >= threshold
precisions.append(
precision_score(true_labels, binary_preds, zero_division= np.nan)
)
recalls.append(recall_score(true_labels, binary_preds, zero_division= np.nan))
f1_scores.append(f1_score(true_labels, binary_preds, zero_division= np.nan))
max_f1 = np.nanmax(f1_scores)
f1s_all_labels.append(max_f1)
best_thresholds.append(thresholds[np.nanargmax(f1_scores)])
ax.plot(thresholds, precisions, label= "Precision" )
ax.plot(thresholds, recalls, label= "Recall" )
ax.plot(thresholds, f1_scores, label= "F1" )
ax.set_xlabel("Threshold" )
ax.set_ylabel("Score" )
ax.set_title(f" { label} " )
ax.legend(loc= "lower right" )
ax.set_ylim((0 , 1.18 ))
ax.annotate(
f"Support: { int (true_labels.sum ())} \n Max F1-score: { max_f1:.2f} " ,
(0 , 1.02 ),
)
plt.tight_layout()
plt.show()
preds_updated_therholds = preds_distilbert_02 > torch.Tensor(best_thresholds)
It is evident that the model highly favours recall over a large decision threshold range. Due to this fact to optimize for the f1-score (increasing precision with minimal losses to recall) thresholds for each class had to be increased.
Classification report with optimal thresholds:
Show the code
print (
classification_report(
true_labels_distilbert_02, preds_updated_therholds, zero_division= np.nan
)
)
precision recall f1-score support
0 0.84 0.77 0.80 3059
1 0.42 0.69 0.52 319
2 0.83 0.82 0.83 1690
3 0.47 0.61 0.53 95
4 0.71 0.82 0.76 1576
5 0.52 0.61 0.56 281
micro avg 0.75 0.78 0.77 7020
macro avg 0.63 0.72 0.67 7020
weighted avg 0.77 0.78 0.77 7020
samples avg 0.71 0.70 0.85 7020
Classification metrics for a binary problem (benign vs toxic):
Show the code
print (
classification_report(
(true_labels_distilbert_02.sum (dim= 1 ) > 0 ),
preds_updated_therholds.sum (dim= 1 ) > 0 ,
)
)
precision recall f1-score support
False 0.97 0.99 0.98 28681
True 0.86 0.77 0.81 3232
accuracy 0.96 31913
macro avg 0.92 0.88 0.90 31913
weighted avg 0.96 0.96 0.96 31913
Model Interpretability
In order to gain insights on how the model makes it’s decisions LIME explanation package is used.
LIME explanations for several toxic comments:
Show the code
toxic_indices = (
test_loader_distilbert_02.dataset.labels.with_columns(
pl.Series(np.arange(len (test_loader_distilbert_02.dataset.labels))).alias(
"index"
)
)
.filter (test_loader_distilbert_02.dataset.labels.sum_horizontal() > 0 )
.sample(3 , seed= 1 )["index" ]
.to_list()
)
for i in toxic_indices:
text = test_loader_distilbert_02.dataset.texts[i]
explainer = LimeTextExplainer(class_names= train_data.columns[- 6 :])
explanation = explainer.explain_instance(
text,
lambda x: aux.predict_probabilities(x, model= model_distilbert_02),
labels= np.arange(6 ),
)
print ("True labels:" )
display(test_loader_distilbert_02.dataset.labels[i])
print ("Explanation:" )
explanation.show_in_notebook()
display('Explanations Complete' )
shape: (1, 6)
i64
i64
i64
i64
i64
i64
1
0
1
0
1
0
shape: (1, 6)
i64
i64
i64
i64
i64
i64
1
0
0
0
0
0
shape: (1, 6)
i64
i64
i64
i64
i64
i64
1
0
1
0
1
0
The explainer highlights the most important words in each comment for a specific label. Yet it is far more interesting to see the interpretations in cases where mistakes were made.
Model explanation for mistaken benign comment:
Show the code
error_index = (
test_loader_distilbert_02.dataset.labels.with_columns(
pl.Series(np.arange(len (test_loader_distilbert_02.dataset.labels))).alias(
"index"
)
)
.filter (
(test_loader_distilbert_02.dataset.labels.sum_horizontal() == 0 )
& pl.Series(preds_updated_therholds.sum (dim= 1 ).numpy() > 0 )
)
.sample(1 , seed= 1 )["index" ]
.item()
)
text = test_loader_distilbert_02.dataset.texts[error_index]
explanation = explainer.explain_instance(
text,
lambda x: aux.predict_probabilities(x, model= model_distilbert_02),
labels= np.arange(6 ),
)
print ("True labels:" )
display(test_loader_distilbert_02.dataset.labels[error_index])
print ("Explanation:" )
explanation.show_in_notebook()
shape: (1, 6)
i64
i64
i64
i64
i64
i64
0
0
0
0
0
0
It is evident that certain keywords strongly override any value of context in the comment.
Model explanation for missed toxic comment:
Show the code
error_index2 = (
test_loader_distilbert_02.dataset.labels.with_columns(
pl.Series(np.arange(len (test_loader_distilbert_02.dataset.labels))).alias(
"index"
)
)
.filter (
(test_loader_distilbert_02.dataset.labels.sum_horizontal() > 0 )
& pl.Series(preds_updated_therholds.sum (dim= 1 ).numpy() == 0 )
)
.sample(1 , seed= 2 )["index" ]
.item()
)
text = test_loader_distilbert_02.dataset.texts[error_index2]
explanation = explainer.explain_instance(
text,
lambda x: aux.predict_probabilities(x, model= model_distilbert_02),
labels= np.arange(6 ),
)
print ("True labels:" )
display(test_loader_distilbert_02.dataset.labels[error_index2])
print ("Explanation:" )
explanation.show_in_notebook()
shape: (1, 6)
i64
i64
i64
i64
i64
i64
1
0
0
0
0
0
Miss classification of toxic comments seem to be less of a problem as the probabilities for toxicity were still high, thus a high recall can be achieved by lowering the threshold rate if needed.
Kaggle Score
The following cells were used to make a late submission to kaggle, it achieved a score of 0.979.
Loading Test Data:
Show the code
test_data = pl.read_csv("data/test.csv" )
for i in range (6 ):
test_data = test_data.with_columns(pl.zeros(len (test_data)).alias(str (i)))
Making predictions:
Show the code
test_dataset = CommentDataModule.CommentDataset(
test_data, test_loader_distilbert_02.dataset.tokenizer, 166
)
testset_loader = torch.utils.data.DataLoader(test_dataset, batch_size= 8 , num_workers= 0 )
model_distilbert_02.eval ()
test_predicts = trainer_distilbert_02.predict(model_distilbert_02, testset_loader)
test_predicts = torch.cat([batch.sigmoid() for batch in test_predicts], dim= 0 )
clear_output()
Exporting in submission format:
Show the code
submission = pl.concat(
[
pl.DataFrame(test_data["id" ]),
pl.DataFrame(test_predicts.numpy(), schema= toxicity_labels),
],
how= "horizontal" ,
)
submission.write_csv("temp/test_submission.csv" )
Conclusions
The model can achieve comment toxicity labeling with a PR-ROC of 0.69.
The model prioritizes recall; achieving high precision is more challenging.
Mistakes made by the model on benign comments may stem from its difficulty in accurately evaluating context.
Further Improvements:
Increase the amount of training data, including a greater number of benign comments, to enhance context detection.
Experiment with various tokenizer max-length parameter values to optimize performance.
Explore different base models for the underlying architecture.
Enhance the training process by incorporating techniques such as layer unfreezing and learning rate scheduling.
2.2 Comment Length
Comment Length Distribution:
Show the code
Most of the comments have less than a 1000 symbols and are capped ant 5000 symbols.
Filtering out comments with no letters:
Show the code
Number of tokens:
Show the code
In order to see the number of tokens in each comment the comments were tokenized using the BERT uncased tokenizer. Toxic comments seem to have a lower median length than benign comments.
90th percentile of tokenized comment length:
Show the code
To get a good compromise between long term token relationship comprehension by the model and processing time. The max_length parameter of tokenizers used will be set to the 90th percentile of the lengths of toxic comments.